Computation and Language 55
☆ Improved Baselines with Visual Instruction Tuning
Large multimodal models (LMM) have recently shown encouraging progress with
visual instruction tuning. In this note, we show that the fully-connected
vision-language cross-modal connector in LLaVA is surprisingly powerful and
data-efficient. With simple modifications to LLaVA, namely, using
CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA
data with simple response formatting prompts, we establish stronger baselines
that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint
uses merely 1.2M publicly available data, and finishes full training in ~1 day
on a single 8-A100 node. We hope this can make state-of-the-art LMM research
more accessible. Code and model will be publicly available.
comment: Tech report, 4 pages. LLaVA project page: https://llava-vl.github.io
☆ MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning
Ke Wang, Houxing Ren, Aojun Zhou, Zimu Lu, Sichun Luo, Weikang Shi, Renrui Zhang, Linqi Song, Mingjie Zhan, Hongsheng Li
The recently released GPT-4 Code Interpreter has demonstrated remarkable
proficiency in solving challenging math problems, primarily attributed to its
ability to seamlessly reason with natural language, generate code, execute
code, and continue reasoning based on the execution output. In this paper, we
present a method to fine-tune open-source language models, enabling them to use
code for modeling and deriving math equations and, consequently, enhancing
their mathematical reasoning abilities. We propose a method of generating novel
and high-quality datasets with math problems and their code-based solutions,
referred to as MathCodeInstruct. Each solution interleaves natural language,
code, and execution results. We also introduce a customized supervised
fine-tuning and inference approach. This approach yields the MathCoder models,
a family of models capable of generating code-based solutions for solving
challenging math problems. Impressively, the MathCoder models achieve
state-of-the-art scores among open-source LLMs on the MATH (45.2%) and GSM8K
(83.9%) datasets, substantially outperforming other open-source alternatives.
Notably, the MathCoder model not only surpasses ChatGPT-3.5 and PaLM-2 on GSM8K
and MATH but also outperforms GPT-4 on the competition-level MATH dataset. The
dataset and models will be released at https://github.com/mathllm/MathCoder.
comment: The state-of-the-art open-source language models for mathematical
reasoning
☆ Modular Speech-to-Text Translation for Zero-Shot Cross-Modal Transfer
Recent research has shown that independently trained encoders and decoders,
combined through a shared fixed-size representation, can achieve competitive
performance in speech-to-text translation. In this work, we show that this type
of approach can be further improved with multilingual training. We observe
significant improvements in zero-shot cross-modal speech translation, even
outperforming a supervised approach based on XLSR for several languages.
☆ A Long Way to Go: Investigating Length Correlations in RLHF
Great successes have been reported using Reinforcement Learning from Human
Feedback (RLHF) to align large language models. Open-source preference datasets
and reward models have enabled wider experimentation beyond generic chat
settings, particularly to make systems more "helpful" for tasks like web
question answering, summarization, and multi-turn dialogue. When optimizing for
helpfulness, RLHF has been consistently observed to drive models to produce
longer outputs. This paper demonstrates that optimizing for response length is
a significant factor behind RLHF's reported improvements in these settings.
First, we study the relationship between reward and length for reward models
trained on three open-source preference datasets for helpfulness. Here, length
correlates strongly with reward, and improvements in reward score are driven in
large part by shifting the distribution over output lengths. We then explore
interventions during both RL and reward model learning to see if we can achieve
the same downstream improvements as RLHF without increasing length. While our
interventions mitigate length increases, they aren't uniformly effective across
settings. Furthermore, we find that even running RLHF with a reward based
solely on length can reproduce most of the downstream improvements over the
initial policy model, showing that reward models in these settings have a long
way to go.
comment: 20 pages, 12 figures
☆ DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, Christopher Potts
The ML community is rapidly exploring techniques for prompting language
models (LMs) and for stacking them into pipelines that solve complex tasks.
Unfortunately, existing LM pipelines are typically implemented using hard-coded
"prompt templates", i.e. lengthy strings discovered via trial and error. Toward
a more systematic approach for developing and optimizing LM pipelines, we
introduce DSPy, a programming model that abstracts LM pipelines as text
transformation graphs, i.e. imperative computational graphs where LMs are
invoked through declarative modules. DSPy modules are parameterized, meaning
they can learn (by creating and collecting demonstrations) how to apply
compositions of prompting, finetuning, augmentation, and reasoning techniques.
We design a compiler that will optimize any DSPy pipeline to maximize a given
metric. We conduct two case studies, showing that succinct DSPy programs can
express and optimize sophisticated LM pipelines that reason about math word
problems, tackle multi-hop retrieval, answer complex questions, and control
agent loops. Within minutes of compiling, a few lines of DSPy allow GPT-3.5 and
llama2-13b-chat to self-bootstrap pipelines that outperform standard few-shot
prompting (generally by over 25% and 65%, respectively) and pipelines with
expert-created demonstrations (by up to 5-46% and 16-40%, respectively). On top
of that, DSPy programs compiled to open and relatively small LMs like
770M-parameter T5 and llama2-13b-chat are competitive with approaches that rely
on expert-written prompt chains for proprietary GPT-3.5. DSPy is available at
https://github.com/stanfordnlp/dspy
☆ Agent Instructs Large Language Models to be General Zero-Shot Reasoners
We introduce a method to improve the zero-shot reasoning abilities of large
language models on general language understanding tasks. Specifically, we build
an autonomous agent to instruct the reasoning process of large language models.
We show this approach further unleashes the zero-shot reasoning abilities of
large language models to more tasks. We study the performance of our method on
a wide set of datasets spanning generation, classification, and reasoning. We
show that our method generalizes to most tasks and obtains state-of-the-art
zero-shot performance on 20 of the 29 datasets that we evaluate. For instance,
our method boosts the performance of state-of-the-art large language models by
a large margin, including Vicuna-13b (13.3%), Llama-2-70b-chat (23.2%), and
GPT-3.5 Turbo (17.0%). Compared to zero-shot chain of thought, our improvement
in reasoning is striking, with an average increase of 10.5%. With our method,
Llama-2-70b-chat outperforms zero-shot GPT-3.5 Turbo by 10.2%.
☆ Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Optimizing large language models (LLMs) for downstream use cases often
involves the customization of pre-trained LLMs through further fine-tuning.
Meta's open release of Llama models and OpenAI's APIs for fine-tuning GPT-3.5
Turbo on custom datasets also encourage this practice. But, what are the safety
costs associated with such custom fine-tuning? We note that while existing
safety alignment infrastructures can restrict harmful behaviors of LLMs at
inference time, they do not cover safety risks when fine-tuning privileges are
extended to end-users. Our red teaming studies find that the safety alignment
of LLMs can be compromised by fine-tuning with only a few adversarially
designed training examples. For instance, we jailbreak GPT-3.5 Turbo's safety
guardrails by fine-tuning it on only 10 such examples at a cost of less than
$0.20 via OpenAI's APIs, making the model responsive to nearly any harmful
instructions. Disconcertingly, our research also reveals that, even without
malicious intent, simply fine-tuning with benign and commonly used datasets can
also inadvertently degrade the safety alignment of LLMs, though to a lesser
extent. These findings suggest that fine-tuning aligned LLMs introduces new
safety risks that current safety infrastructures fall short of addressing --
even if a model's initial safety alignment is impeccable, it is not necessarily
to be maintained after custom fine-tuning. We outline and critically analyze
potential mitigations and advocate for further research efforts toward
reinforcing safety protocols for the custom fine-tuning of aligned LLMs.
☆ DecoderLens: Layerwise Interpretation of Encoder-Decoder Transformers
In recent years, many interpretability methods have been proposed to help
interpret the internal states of Transformer-models, at different levels of
precision and complexity. Here, to analyze encoder-decoder Transformers, we
propose a simple, new method: DecoderLens. Inspired by the LogitLens (for
decoder-only Transformers), this method involves allowing the decoder to
cross-attend representations of intermediate encoder layers instead of using
the final encoder output, as is normally done in encoder-decoder models. The
method thus maps previously uninterpretable vector representations to
human-interpretable sequences of words or symbols. We report results from the
DecoderLens applied to models trained on question answering, logical reasoning,
speech recognition and machine translation. The DecoderLens reveals several
specific subtasks that are solved at low or intermediate layers, shedding new
light on the information flow inside the encoder component of this important
class of models.
☆ GoLLIE: Annotation Guidelines improve Zero-Shot Information-Extraction
Large Language Models (LLMs) combined with instruction tuning have made
significant progress when generalizing to unseen tasks. However, they have been
less successful in Information Extraction (IE), lagging behind task-specific
models. Typically, IE tasks are characterized by complex annotation guidelines
which describe the task and give examples to humans. Previous attempts to
leverage such information have failed, even with the largest models, as they
are not able to follow the guidelines out-of-the-box. In this paper we propose
GoLLIE (Guideline-following Large Language Model for IE), a model able to
improve zero-shot results on unseen IE tasks by virtue of being fine-tuned to
comply with annotation guidelines. Comprehensive evaluation empirically
demonstrates that GoLLIE is able to generalize to and follow unseen guidelines,
outperforming previous attempts at zero-shot information extraction. The
ablation study shows that detailed guidelines is key for good results.
☆ MapperGPT: Large Language Models for Linking and Mapping Entities
Nicolas Matentzoglu, J. Harry Caufield, Harshad B. Hegde, Justin T. Reese, Sierra Moxon, Hyeongsik Kim, Nomi L. Harris, Melissa A Haendel, Christopher J. Mungall
Aligning terminological resources, including ontologies, controlled
vocabularies, taxonomies, and value sets is a critical part of data integration
in many domains such as healthcare, chemistry, and biomedical research. Entity
mapping is the process of determining correspondences between entities across
these resources, such as gene identifiers, disease concepts, or chemical entity
identifiers. Many tools have been developed to compute such mappings based on
common structural features and lexical information such as labels and synonyms.
Lexical approaches in particular often provide very high recall, but low
precision, due to lexical ambiguity. As a consequence of this, mapping efforts
often resort to a labor intensive manual mapping refinement through a human
curator.
Large Language Models (LLMs), such as the ones employed by ChatGPT, have
generalizable abilities to perform a wide range of tasks, including
question-answering and information extraction. Here we present MapperGPT, an
approach that uses LLMs to review and refine mapping relationships as a
post-processing step, in concert with existing high-recall methods that are
based on lexical and structural heuristics.
We evaluated MapperGPT on a series of alignment tasks from different domains,
including anatomy, developmental biology, and renal diseases. We devised a
collection of tasks that are designed to be particularly challenging for
lexical methods. We show that when used in combination with high-recall
methods, MapperGPT can provide a substantial improvement in accuracy, beating
state-of-the-art (SOTA) methods such as LogMap.
☆ TRAM: Bridging Trust Regions and Sharpness Aware Minimization ICLR 2024
By reducing the curvature of the loss surface in the parameter space,
Sharpness-aware minimization (SAM) yields widespread robustness improvement
under domain transfer. Instead of focusing on parameters, however, this work
considers the transferability of representations as the optimization target for
out-of-domain generalization in a fine-tuning setup. To encourage the retention
of transferable representations, we consider trust region-based fine-tuning
methods, which exploit task-specific skills without forgetting task-agnostic
representations from pre-training. We unify parameter- and representation-space
smoothing approaches by using trust region bounds to inform SAM-style
regularizers on both of these optimization surfaces. We propose Trust Region
Aware Minimization (TRAM), a fine-tuning algorithm that optimizes for flat
minima and smooth, informative representations without forgetting pre-trained
structure. We find that TRAM outperforms both sharpness-aware and trust
region-based optimization methods on cross-domain language modeling and
cross-lingual transfer, where robustness to domain transfer and representation
generality are critical for success. TRAM establishes a new standard in
training generalizable models with minimal additional computation.
comment: 17 pages, 11 tables, 1 figure. Submitted to ICLR 2024
☆ Evaluating Self-Supervised Speech Representations for Indigenous American Languages
The application of self-supervision to speech representation learning has
garnered significant interest in recent years, due to its scalability to large
amounts of unlabeled data. However, much progress, both in terms of
pre-training and downstream evaluation, has remained concentrated in
monolingual models that only consider English. Few models consider other
languages, and even fewer consider indigenous ones. In our submission to the
New Language Track of the ASRU 2023 ML-SUPERB Challenge, we present an ASR
corpus for Quechua, an indigenous South American Language. We benchmark the
efficacy of large SSL models on Quechua, along with 6 other indigenous
languages such as Guarani and Bribri, on low-resource ASR. Our results show
surprisingly strong performance by state-of-the-art SSL models, showing the
potential generalizability of large-scale models to real-world data.
☆ CLEVRER-Humans: Describing Physical and Causal Events the Human Way NeurIPS 2022
Building machines that can reason about physical events and their causal
relationships is crucial for flexible interaction with the physical world.
However, most existing physical and causal reasoning benchmarks are exclusively
based on synthetically generated events and synthetic natural language
descriptions of causal relationships. This design brings up two issues. First,
there is a lack of diversity in both event types and natural language
descriptions; second, causal relationships based on manually-defined heuristics
are different from human judgments. To address both shortcomings, we present
the CLEVRER-Humans benchmark, a video reasoning dataset for causal judgment of
physical events with human labels. We employ two techniques to improve data
collection efficiency: first, a novel iterative event cloze task to elicit a
new representation of events in videos, which we term Causal Event Graphs
(CEGs); second, a data augmentation technique based on neural language
generative models. We convert the collected CEGs into questions and answers to
be consistent with prior work. Finally, we study a collection of baseline
approaches for CLEVRER-Humans question-answering, highlighting the great
challenges set forth by our benchmark.
comment: NeurIPS 2022 (Dataset and Benchmark Track). First two authors
contributed equally. Project page:
https://sites.google.com/stanford.edu/clevrer-humans/home
☆ Redefining Digital Health Interfaces with Large Language Models
Digital health tools have the potential to significantly improve the delivery
of healthcare services. However, their use remains comparatively limited due,
in part, to challenges surrounding usability and trust. Recently, Large
Language Models (LLMs) have emerged as general-purpose models with the ability
to process complex information and produce human-quality text, presenting a
wealth of potential applications in healthcare. Directly applying LLMs in
clinical settings is not straightforward, with LLMs susceptible to providing
inconsistent or nonsensical answers. We demonstrate how LLMs can utilize
external tools to provide a novel interface between clinicians and digital
technologies. This enhances the utility and practical impact of digital
healthcare tools and AI models while addressing current issues with using LLM
in clinical settings such as hallucinations. We illustrate our approach with
examples from cardiovascular disease and diabetes risk prediction, highlighting
the benefit compared to traditional interfaces for digital tools.
☆ Towards Robust and Generalizable Training: An Empirical Study of Noisy Slot Filling for Input Perturbations
Jiachi Liu, Liwen Wang, Guanting Dong, Xiaoshuai Song, Zechen Wang, Zhengyang Wang, Shanglin Lei, Jinzheng Zhao, Keqing He, Bo Xiao, Weiran Xu
In real dialogue scenarios, as there are unknown input noises in the
utterances, existing supervised slot filling models often perform poorly in
practical applications. Even though there are some studies on noise-robust
models, these works are only evaluated on rule-based synthetic datasets, which
is limiting, making it difficult to promote the research of noise-robust
methods. In this paper, we introduce a noise robustness evaluation dataset
named Noise-SF for slot filling task. The proposed dataset contains five types
of human-annotated noise, and all those noises are exactly existed in real
extensive robust-training methods of slot filling into the proposed framework.
By conducting exhaustive empirical evaluation experiments on Noise-SF, we find
that baseline models have poor performance in robustness evaluation, and the
proposed framework can effectively improve the robustness of models. Based on
the empirical experimental results, we make some forward-looking suggestions to
fuel the research in this direction. Our dataset Noise-SF will be released at
https://github.com/dongguanting/Noise-SF.
comment: Working in progress
☆ Tik-to-Tok: Translating Language Models One Token at a Time: An Embedding Initialization Strategy for Efficient Language Adaptation ACL
Training monolingual language models for low and mid-resource languages is
made challenging by limited and often inadequate pretraining data. In this
study, we propose a novel model conversion strategy to address this issue,
adapting high-resources monolingual language models to a new target language.
By generalizing over a word translation dictionary encompassing both the source
and target languages, we map tokens from the target tokenizer to semantically
similar tokens from the source language tokenizer. This one-to-many token
mapping improves tremendously the initialization of the embedding table for the
target language. We conduct experiments to convert high-resource models to mid-
and low-resource languages, namely Dutch and Frisian. These converted models
achieve a new state-of-the-art performance on these languages across all sorts
of downstream tasks. By reducing significantly the amount of data and time
required for training state-of-the-art models, our novel model conversion
strategy has the potential to benefit many languages worldwide.
comment: As first reviewed at TACL
☆ Controllable Multi-document Summarization: Coverage & Coherence Intuitive Policy with Large Language Model Based Rewards
Memory-efficient large language models are good at refining text input for
better readability. However, controllability is a matter of concern when it
comes to text generation tasks with long inputs, such as multi-document
summarization. In this work, we investigate for a generic controllable approach
for multi-document summarization that leverages the capabilities of LLMs to
refine the text. In particular, we train a controllable content extraction
scheme to extract the text that will be refined by an LLM. The scheme is
designed with a novel coverage and coherence intuitive policy, which is duly
rewarded by a passively trained LLM. Our approach yields competitive results in
the evaluation using ROUGE metrics and outperforms potential baselines in
coherence, as per human evaluation.
☆ The North System for Formosa Speech Recognition Challenge 2023
This report provides a concise overview of the proposed North system, which
aims to achieve automatic word/syllable recognition for Taiwanese Hakka
(Sixian). The report outlines three key components of the system: the
acquisition, composition, and utilization of the training data; the
architecture of the model; and the hardware specifications and operational
statistics. The demonstration of the system can be found at
https://asrvm.iis.sinica.edu.tw/hakka_sixian.
☆ Neural Language Model Pruning for Automatic Speech Recognition
We study model pruning methods applied to Transformer-based neural network
language models for automatic speech recognition. We explore three aspects of
the pruning frame work, namely criterion, method and scheduler, analyzing their
contribution in terms of accuracy and inference speed. To the best of our
knowledge, such in-depth analyses on large-scale recognition systems has not
been reported in the literature. In addition, we propose a variant of low-rank
approximation suitable for incrementally compressing models, and delivering
multiple models with varied target sizes. Among other results, we show that a)
data-driven pruning outperforms magnitude-driven in several scenarios; b)
incremental pruning achieves higher accuracy compared to one-shot pruning,
especially when targeting smaller sizes; and c) low-rank approximation presents
the best trade-off between size reduction and inference speed-up for moderate
compression.
comment: 8 pages, 3 figures
☆ LLM Based Multi-Document Summarization Exploiting Main-Event Biased Monotone Submodular Content Extraction
Multi-document summarization is a challenging task due to its inherent
subjective bias, highlighted by the low inter-annotator ROUGE-1 score of 0.4
among DUC-2004 reference summaries. In this work, we aim to enhance the
objectivity of news summarization by focusing on the main event of a group of
related news documents and presenting it coherently with sufficient context.
Our primary objective is to succinctly report the main event, ensuring that the
summary remains objective and informative. To achieve this, we employ an
extract-rewrite approach that incorporates a main-event biased
monotone-submodular function for content selection. This enables us to extract
the most crucial information related to the main event from the document
cluster. To ensure coherence, we utilize a fine-tuned Language Model (LLM) for
rewriting the extracted content into a coherent text. The evaluation using
objective metrics and human evaluators confirms the effectiveness of our
approach, as it surpasses potential baselines, demonstrating excellence in both
content coverage, coherence, and informativeness.
☆ Procedural Text Mining with Large Language Models
Recent advancements in the field of Natural Language Processing, particularly
the development of large-scale language models that are pretrained on vast
amounts of knowledge, are creating novel opportunities within the realm of
Knowledge Engineering. In this paper, we investigate the usage of large
language models (LLMs) in both zero-shot and in-context learning settings to
tackle the problem of extracting procedures from unstructured PDF text in an
incremental question-answering fashion. In particular, we leverage the current
state-of-the-art GPT-4 (Generative Pre-trained Transformer 4) model,
accompanied by two variations of in-context learning that involve an ontology
with definitions of procedures and steps and a limited number of samples of
few-shot learning. The findings highlight both the promise of this approach and
the value of the in-context learning customisations. These modifications have
the potential to significantly address the challenge of obtaining sufficient
training data, a hurdle often encountered in deep learning-based Natural
Language Processing techniques for procedure extraction.
comment: 8 pages, 4 figures, Accepted to The Twelfth International Conference
on Knowledge Capture (K-Cap 2023)
☆ Evaluating Hallucinations in Chinese Large Language Models
Qinyuan Cheng, Tianxiang Sun, Wenwei Zhang, Siyin Wang, Xiangyang Liu, Mozhi Zhang, Junliang He, Mianqiu Huang, Zhangyue Yin, Kai Chen, Xipeng Qiu
In this paper, we establish a benchmark named HalluQA (Chinese Hallucination
Question-Answering) to measure the hallucination phenomenon in Chinese large
language models. HalluQA contains 450 meticulously designed adversarial
questions, spanning multiple domains, and takes into account Chinese historical
culture, customs, and social phenomena. During the construction of HalluQA, we
consider two types of hallucinations: imitative falsehoods and factual errors,
and we construct adversarial samples based on GLM-130B and ChatGPT. For
evaluation, we design an automated evaluation method using GPT-4 to judge
whether a model output is hallucinated. We conduct extensive experiments on 24
large language models, including ERNIE-Bot, Baichuan2, ChatGLM, Qwen, SparkDesk
and etc. Out of the 24 models, 18 achieved non-hallucination rates lower than
50%. This indicates that HalluQA is highly challenging. We analyze the primary
types of hallucinations in different types of models and their causes.
Additionally, we discuss which types of hallucinations should be prioritized
for different types of models.
comment: Work in progress
☆ Reformulating Domain Adaptation of Large Language Models as Adapt-Retrieve-Revise ICLR 2024
While large language models (LLMs) like GPT-4 have recently demonstrated
astonishing zero-shot capabilities in general domain tasks, they often generate
content with hallucinations in specific domains such as Chinese law, hindering
their application in these areas. This is typically due to the absence of
training data that encompasses such a specific domain, preventing GPT-4 from
acquiring in-domain knowledge. A pressing challenge is that it's not plausible
to continue training LLMs of such scale on in-domain data.
This paper introduces a simple and effective domain adaptation framework for
GPT-4 by reformulating generation as an \textbf{adapt-retrieve-revise} process.
The initial step is to \textbf{adapt} an affordable 7B LLM to the target domain
by continuing learning on in-domain data. When solving a task, we leverage the
adapted LLM to generate a draft answer given a task query. Then, the draft
answer will be used to \textbf{retrieve} supporting evidence candidates from an
external in-domain knowledge base. Finally, the draft answer and retrieved
evidence are concatenated into a whole prompt to let GPT-4 assess the evidence
and \textbf{revise} the draft answer to generate the final answer.
Our proposal combines the advantages of the efficiency of adapting a smaller
7B model with the evidence-assessing capability of GPT-4 and effectively
prevents GPT-4 from generating hallucinatory content. In the zero-shot setting
of four Chinese legal tasks, our method improves accuracy by 33.3\% compared to
the direct generation by GPT-4. When compared to two stronger retrieval-based
baselines, our method outperforms them by 15.4\% and 23.9\%. Our code will be
released
comment: Under submission to ICLR 2024
☆ Concise and Organized Perception Facilitates Large Language Models for Deductive Reasoning
Exploiting large language models (LLMs) to tackle deductive reasoning has
garnered growing attention. It still remains highly challenging to achieve
satisfactory results in complex deductive problems, characterized by plenty of
premises (i.e., facts or rules) entailing intricate relationships among
entities and requiring multi-hop reasoning. One intuitive solution is to
decompose the original task into smaller sub-tasks, and then chain the multiple
casual reasoning steps together in a forward (e.g., Selection-Inference) or
backward (e.g., LAMBADA) direction. However, these techniques inevitably
necessitate a large number of overall stages, leading to computationally
expensive operations and a higher possibility of making misleading steps. In
addition to stage-by-stage decomposition, we draw inspiration from another
aspect of human problem-solving. Humans tend to distill the most relevant
information and organize their thoughts systematically (e.g., creating mind
maps), which assists them in answering questions or drawing conclusions
precisely and quickly. In light of this, we propose a novel reasoning approach
named Concise and Organized Perception (COP). COP carefully analyzes the given
statements to efficiently identify the most pertinent information while
eliminating redundancy. It then prompts the LLMs in a more organized form that
adapts to the model's inference process. By perceiving concise and organized
proofs, the deductive reasoning abilities of LLMs can be better elicited, and
the risk of acquiring errors caused by excessive reasoning stages is mitigated.
Furthermore, our approach can be combined with the aforementioned ones to
further boost their performance. Extensive experimental results on three
popular deductive benchmarks (i.e., ProofWriter, PrOntoQA and PrOntoQA-OOD)
show that COP significantly outperforms previous state-of-the-art methods.
comment: Under Review
☆ Learning Personalized Story Evaluation
While large language models (LLMs) have shown impressive results for more
objective tasks such as QA and retrieval, it remains nontrivial to evaluate
their performance on open-ended text generation for reasons including (1) data
contamination; (2) multi-dimensional evaluation criteria; and (3)
subjectiveness stemming from reviewers' personal preferences. To address such
issues, we propose to model personalization in an uncontaminated open-ended
generation assessment. We create two new datasets Per-MPST and Per-DOC for
personalized story evaluation, by re-purposing existing datasets with proper
anonymization and new personalized labels. We further develop a personalized
story evaluation model PERSE to infer reviewer preferences and provide a
personalized evaluation. Specifically, given a few exemplary reviews from a
particular reviewer, PERSE predicts either a detailed review or fine-grained
comparison in several aspects (such as interestingness and surprise) for that
reviewer on a new text input. Experimental results show that PERSE outperforms
GPT-4 by 15.8% on Kendall correlation of story ratings, and by 13.7% on
pairwise preference prediction accuracy. Both datasets and code will be
released at https://github.com/dqwang122/PerSE.
comment: 19 pages
☆ A New Dialogue Response Generation Agent for Large Language Models by Asking Questions to Detect User's Intentions
Large Language Models (LLMs), such as ChatGPT, have recently been applied to
various NLP tasks due to its open-domain generation capabilities. However,
there are two issues with applying LLMs to dialogue tasks. 1. During the
dialogue process, users may have implicit intentions that might be overlooked
by LLMs. Consequently, generated responses couldn't align with the user's
intentions. 2. It is unlikely for LLMs to encompass all fields comprehensively.
In certain specific domains, their knowledge may be incomplete, and LLMs cannot
update the latest knowledge in real-time. To tackle these issues, we propose a
framework~\emph{using LLM to \textbf{E}nhance dialogue response generation by
asking questions to \textbf{D}etect user's \textbf{I}mplicit
in\textbf{T}entions} (\textbf{EDIT}). Firstly, EDIT generates open questions
related to the dialogue context as the potential user's intention; Then, EDIT
answers those questions by interacting with LLMs and searching in
domain-specific knowledge bases respectively, and use LLMs to choose the proper
answers to questions as extra knowledge; Finally, EDIT enhances response
generation by explicitly integrating those extra knowledge. Besides, previous
question generation works only focus on asking questions with answers in
context. In order to ask open questions, we construct a Context-Open-Question
(COQ) dataset. On two task-oriented dialogue tasks (Wizard of Wikipedia and
Holl-E), EDIT outperformed other LLMs.
☆ A Formalism and Approach for Improving Robustness of Large Language Models Using Risk-Adjusted Confidence Scores
Large Language Models (LLMs), such as ChatGPT, have achieved impressive
milestones in natural language processing (NLP). Despite their impressive
performance, the models are known to pose important risks. As these models are
deployed in real-world applications, a systematic understanding of different
risks posed by these models on tasks such as natural language inference (NLI),
is much needed. In this paper, we define and formalize two distinct types of
risk: decision risk and composite risk. We also propose a risk-centric
evaluation framework, and four novel metrics, for assessing LLMs on these risks
in both in-domain and out-of-domain settings. Finally, we propose a
risk-adjusted calibration method called DwD for helping LLMs minimize these
risks in an overall NLI architecture. Detailed experiments, using four NLI
benchmarks, three baselines and two LLMs, including ChatGPT, show both the
practical utility of the evaluation framework, and the efficacy of DwD in
reducing decision and composite risk. For instance, when using DwD, an
underlying LLM is able to address an extra 20.1% of low-risk inference tasks
(but which the LLM erroneously deems high-risk without risk adjustment) and
skip a further 19.8% of high-risk tasks, which would have been answered
incorrectly.
☆ InstructProtein: Aligning Human and Protein Language via Knowledge Instruction
Large Language Models (LLMs) have revolutionized the field of natural
language processing, but they fall short in comprehending biological sequences
such as proteins. To address this challenge, we propose InstructProtein, an
innovative LLM that possesses bidirectional generation capabilities in both
human and protein languages: (i) taking a protein sequence as input to predict
its textual function description and (ii) using natural language to prompt
protein sequence generation. To achieve this, we first pre-train an LLM on both
protein and natural language corpora, enabling it to comprehend individual
languages. Then supervised instruction tuning is employed to facilitate the
alignment of these two distinct languages. Herein, we introduce a knowledge
graph-based instruction generation framework to construct a high-quality
instruction dataset, addressing annotation imbalance and instruction deficits
in existing protein-text corpus. In particular, the instructions inherit the
structural relations between proteins and function annotations in knowledge
graphs, which empowers our model to engage in the causal modeling of protein
functions, akin to the chain-of-thought processes in natural languages.
Extensive experiments on bidirectional protein-text generation tasks show that
InstructProtein outperforms state-of-the-art LLMs by large margins. Moreover,
InstructProtein serves as a pioneering step towards text-based protein function
prediction and sequence design, effectively bridging the gap between protein
and human language understanding.
☆ Unlock Predictable Scaling from Emergent Abilities
Shengding Hu, Xin Liu, Xu Han, Xinrong Zhang, Chaoqun He, Weilin Zhao, Yankai Lin, Ning Ding, Zebin Ou, Guoyang Zeng, Zhiyuan Liu, Maosong Sun
The scientific scale-up of large language models (LLMs) necessitates a
comprehensive understanding of their scaling properties. However, the existing
literature on the scaling properties only yields an incomplete answer:
optimization loss decreases predictably as the model size increases, in line
with established scaling law; yet no scaling law for task has been established
and the task performances are far from predictable during scaling. Task
performances typically show minor gains on small models until they improve
dramatically once models exceed a size threshold, exemplifying the ``emergent
abilities''. In this study, we discover that small models, although they
exhibit minor performance, demonstrate critical and consistent task performance
improvements that are not captured by conventional evaluation strategies due to
insufficient measurement resolution. To measure such improvements, we introduce
PassUntil, an evaluation strategy through massive sampling in the decoding
phase. We conduct quantitative investigations into the scaling law of task
performance. Firstly, a strict task scaling law is identified, enhancing the
predictability of task performances. Remarkably, we are able to predict the
performance of the 2.4B model on code generation with merely 0.05\% deviation
before training starts. Secondly, underpinned by PassUntil, we observe concrete
evidence of emergent abilities and ascertain that they are not in conflict with
the continuity of performance improvement. Their semblance to break-through is
that their scaling curve cannot be fitted by standard scaling law function. We
then introduce a mathematical definition for the emergent abilities. Through
the definition, we refute a prevalent ``multi-step reasoning hypothesis''
regarding the genesis of emergent abilities and propose a new hypothesis with a
satisfying fit to the observed scaling curve.
comment: 10 pages main paper, 8 pages appendix
☆ Can Large Language Models be Good Path Planners? A Benchmark and Investigation on Spatial-temporal Reasoning
Large language models (LLMs) have achieved remarkable success across a wide
spectrum of tasks; however, they still face limitations in scenarios that
demand long-term planning and spatial reasoning. To facilitate this line of
research, in this work, we propose a new benchmark, termed $\textbf{P}$ath
$\textbf{P}$lanning from $\textbf{N}$atural $\textbf{L}$anguage
($\textbf{PPNL}$). Our benchmark evaluates LLMs' spatial-temporal reasoning by
formulating ''path planning'' tasks that require an LLM to navigate to target
locations while avoiding obstacles and adhering to constraints. Leveraging this
benchmark, we systematically investigate LLMs including GPT-4 via different
few-shot prompting methodologies and BART and T5 of various sizes via
fine-tuning. Our experimental results show the promise of few-shot GPT-4 in
spatial reasoning, when it is prompted to reason and act interleavedly,
although it still fails to make long-term temporal reasoning. In contrast,
while fine-tuned LLMs achieved impressive results on in-distribution reasoning
tasks, they struggled to generalize to larger environments or environments with
more obstacles.
☆ Deep Representations of First-person Pronouns for Prediction of Depression Symptom Severity
Prior work has shown that analyzing the use of first-person singular pronouns
can provide insight into individuals' mental status, especially depression
symptom severity. These findings were generated by counting frequencies of
first-person singular pronouns in text data. However, counting doesn't capture
how these pronouns are used. Recent advances in neural language modeling have
leveraged methods generating contextual embeddings. In this study, we sought to
utilize the embeddings of first-person pronouns obtained from contextualized
language representation models to capture ways these pronouns are used, to
analyze mental status. De-identified text messages sent during online
psychotherapy with weekly assessment of depression severity were used for
evaluation. Results indicate the advantage of contextualized first-person
pronoun embeddings over standard classification token embeddings and
frequency-based pronoun analysis results in predicting depression symptom
severity. This suggests contextual representations of first-person pronouns can
enhance the predictive utility of language used by people with depression
symptoms.
comment: Accepted: AMIA Annual Symposium 2023. To appear as: Ren X, Burkhardt
H, Are\'an P, Hull T, Cohen T. Deep Representations of First-person Pronouns
for Prediction of Depression Symptom Severity. AMIA Annual Symposium
Proceedings 2023. American Medical Informatics Association
☆ FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation
Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry Wei, Jason Wei, Chris Tar, Yun-Hsuan Sung, Denny Zhou, Quoc Le, Thang Luong
Most large language models (LLMs) are trained once and never updated; thus,
they lack the ability to dynamically adapt to our ever-changing world. In this
work, we perform a detailed study of the factuality of LLM-generated text in
the context of answering questions that test current world knowledge.
Specifically, we introduce FreshQA, a novel dynamic QA benchmark encompassing a
diverse range of question and answer types, including questions that require
fast-changing world knowledge as well as questions with false premises that
need to be debunked. We benchmark a diverse array of both closed and
open-source LLMs under a two-mode evaluation procedure that allows us to
measure both correctness and hallucination. Through human evaluations involving
more than 50K judgments, we shed light on limitations of these models and
demonstrate significant room for improvement: for instance, all models
(regardless of model size) struggle on questions that involve fast-changing
knowledge and false premises. Motivated by these results, we present
FreshPrompt, a simple few-shot prompting method that substantially boosts the
performance of an LLM on FreshQA by incorporating relevant and up-to-date
information retrieved from a search engine into the prompt. Our experiments
show that FreshPrompt outperforms both competing search engine-augmented
prompting methods such as Self-Ask (Press et al., 2022) as well as commercial
systems such as Perplexity.AI. Further analysis of FreshPrompt reveals that
both the number of retrieved evidences and their order play a key role in
influencing the correctness of LLM-generated answers. Additionally, instructing
the LLM to generate concise and direct answers helps reduce hallucination
compared to encouraging more verbose answers. To facilitate future work, we
release FreshQA at github.com/freshllms/freshqa and commit to updating it at
regular intervals.
comment: Preprint, 22 pages, 7 figures, 5 tables
♻ ☆ DQ-LoRe: Dual Queries with Low Rank Approximation Re-ranking for In-Context Learning
Jiong Xiong, Zixuan Li, Chuanyang Zheng, Zhijiang Guo, Yichun Yin, Enze Xie, Zhicheng Yang, Qingxing Cao, Haiming Wang, Xiongwei Han, Jing Tang, Chengming Li, Xiaodan Liang
Recent advances in natural language processing, primarily propelled by Large
Language Models (LLMs), have showcased their remarkable capabilities grounded
in in-context learning. A promising avenue for guiding LLMs in intricate
reasoning tasks involves the utilization of intermediate reasoning steps within
the Chain-of-Thought (CoT) paradigm. Nevertheless, the central challenge lies
in the effective selection of exemplars for facilitating in-context learning.
In this study, we introduce a framework that leverages Dual Queries and
Low-rank approximation Re-ranking (DQ-LoRe) to automatically select exemplars
for in-context learning. Dual Queries first query LLM to obtain LLM-generated
knowledge such as CoT, then query the retriever to obtain the final exemplars
via both question and the knowledge. Moreover, for the second query, LoRe
employs dimensionality reduction techniques to refine exemplar selection,
ensuring close alignment with the input question's knowledge. Through extensive
experiments, we demonstrate that DQ-LoRe significantly outperforms prior
state-of-the-art methods in the automatic selection of exemplars for GPT-4,
enhancing performance from 92.5\% to 94.2\%. Our comprehensive analysis further
reveals that DQ-LoRe consistently outperforms retrieval-based approaches in
terms of both performance and adaptability, especially in scenarios
characterized by distribution shifts. DQ-LoRe pushes the boundaries of
in-context learning and opens up new avenues for addressing complex reasoning
challenges. We will release the code soon.
♻ ☆ Explaining Emergent In-Context Learning as Kernel Regression
Large language models (LLMs) have initiated a paradigm shift in transfer
learning. In contrast to the classic pretraining-then-finetuning procedure, in
order to use LLMs for downstream prediction tasks, one only needs to provide a
few demonstrations, known as in-context examples, without adding more or
updating existing model parameters. This in-context learning (ICL) capability
of LLMs is intriguing, and it is not yet fully understood how pretrained LLMs
acquire such capabilities. In this paper, we investigate the reason why a
transformer-based language model can accomplish in-context learning after
pre-training on a general language corpus by proposing one hypothesis that LLMs
can simulate kernel regression with internal representations when faced with
in-context examples. More concretely, we first prove that Bayesian inference on
in-context prompts can be asymptotically understood as kernel regression $\hat
y = \sum_i y_i K(x, x_i)/\sum_i K(x, x_i)$ as the number of in-context
demonstrations grows. Then, we empirically investigate the in-context behaviors
of language models. We find that during ICL, the attention and hidden features
in LLMs match the behaviors of a kernel regression. Finally, our theory
provides insights into multiple phenomena observed in the ICL field: why
retrieving demonstrative samples similar to test samples can help, why ICL
performance is sensitive to the output formats, and why ICL accuracy benefits
from selecting in-distribution and representative samples.
comment: 9 pages, 4 figures
♻ ☆ LC-Score: Reference-less estimation of Text Comprehension Difficulty
Being able to read and understand written text is critical in a digital era.
However, studies shows that a large fraction of the population experiences
comprehension issues. In this context, further initiatives in accessibility are
required to improve the audience text comprehension. However, writers are
hardly assisted nor encouraged to produce easy-to-understand content. Moreover,
Automatic Text Simplification (ATS) model development suffers from the lack of
metric to accurately estimate comprehension difficulty We present
\textsc{LC-Score}, a simple approach for training text comprehension metric for
any French text without reference \ie predicting how easy to understand a given
text is on a $[0, 100]$ scale. Our objective with this scale is to
quantitatively capture the extend to which a text suits to the \textit{Langage
Clair} (LC, \textit{Clear Language}) guidelines, a French initiative closely
related to English Plain Language. We explore two approaches: (i) using
linguistically motivated indicators used to train statistical models, and (ii)
neural learning directly from text leveraging pre-trained language models. We
introduce a simple proxy task for comprehension difficulty training as a
classification task. To evaluate our models, we run two distinct human
annotation experiments, and find that both approaches (indicator based and
neural) outperforms commonly used readability and comprehension metrics such as
FKGL.
♻ ☆ Towards Inferential Reproducibility of Machine Learning Research ICLR 2023
Reliability of machine learning evaluation -- the consistency of observed
evaluation scores across replicated model training runs -- is affected by
several sources of nondeterminism which can be regarded as measurement noise.
Current tendencies to remove noise in order to enforce reproducibility of
research results neglect inherent nondeterminism at the implementation level
and disregard crucial interaction effects between algorithmic noise factors and
data properties. This limits the scope of conclusions that can be drawn from
such experiments. Instead of removing noise, we propose to incorporate several
sources of variance, including their interaction with data properties, into an
analysis of significance and reliability of machine learning evaluation, with
the aim to draw inferences beyond particular instances of trained models. We
show how to use linear mixed effects models (LMEMs) to analyze performance
evaluation scores, and to conduct statistical inference with a generalized
likelihood ratio test (GLRT). This allows us to incorporate arbitrary sources
of noise like meta-parameter variations into statistical significance testing,
and to assess performance differences conditional on data properties.
Furthermore, a variance component analysis (VCA) enables the analysis of the
contribution of noise sources to overall variance and the computation of a
reliability coefficient by the ratio of substantial to total variance.
comment: Published at ICLR 2023
♻ ☆ Large-scale investigation of weakly-supervised deep learning for the fine-grained semantic indexing of biomedical literature
Anastasios Nentidis, Thomas Chatzopoulos, Anastasia Krithara, Grigorios Tsoumakas, Georgios Paliouras
Objective: Semantic indexing of biomedical literature is usually done at the
level of MeSH descriptors with several related but distinct biomedical concepts
often grouped together and treated as a single topic. This study proposes a new
method for the automated refinement of subject annotations at the level of MeSH
concepts. Methods: Lacking labelled data, we rely on weak supervision based on
concept occurrence in the abstract of an article, which is also enhanced by
dictionary-based heuristics. In addition, we investigate deep learning
approaches, making design choices to tackle the particular challenges of this
task. The new method is evaluated on a large-scale retrospective scenario,
based on concepts that have been promoted to descriptors. Results: In our
experiments concept occurrence was the strongest heuristic achieving a macro-F1
score of about 0.63 across several labels. The proposed method improved it
further by more than 4pp. Conclusion: The results suggest that concept
occurrence is a strong heuristic for refining the coarse-grained labels at the
level of MeSH concepts and the proposed method improves it further.
comment: 26 pages, 5 figures, 4 tables. A more concise version
♻ ☆ SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning
The recent progress in large language models (LLMs), especially the invention
of chain-of-thought prompting, has made it possible to automatically answer
questions by stepwise reasoning. However, when faced with more complicated
problems that require non-linear thinking, even the strongest LLMs make
mistakes. To address this, we explore whether LLMs are able to recognize errors
in their own step-by-step reasoning, without resorting to external resources.
To this end, we propose SelfCheck, a general-purpose zero-shot verification
schema for recognizing such errors. We then use the results of these checks to
improve question-answering performance by conducting weighted voting on
multiple solutions to the question. We test SelfCheck on three datasets (GSM8K,
MathQA, and MATH) and find that it successfully recognizes errors and, in turn,
increases final answer accuracies.
♻ ☆ On the definition of toxicity in NLP
The fundamental problem in toxicity detection task lies in the fact that the
toxicity is ill-defined. This causes us to rely on subjective and vague data in
models' training, which results in non-robust and non-accurate results: garbage
in - garbage out.
This work suggests a new, stress-level-based definition of toxicity designed
to be objective and context-aware. On par with it, we also describe possible
ways of applying this new definition to dataset creation and model training.
♻ ☆ Using Large Language Models for Qualitative Analysis can Introduce Serious Bias
Large Language Models (LLMs) are quickly becoming ubiquitous, but the
implications for social science research are not yet well understood. This
paper asks whether LLMs can help us analyse large-N qualitative data from
open-ended interviews, with an application to transcripts of interviews with
Rohingya refugees in Cox's Bazaar, Bangladesh. We find that a great deal of
caution is needed in using LLMs to annotate text as there is a risk of
introducing biases that can lead to misleading inferences. We here mean bias in
the technical sense, that the errors that LLMs make in annotating interview
transcripts are not random with respect to the characteristics of the interview
subjects. Training simpler supervised models on high-quality human annotations
with flexible coding leads to less measurement error and bias than LLM
annotations. Therefore, given that some high quality annotations are necessary
in order to asses whether an LLM introduces bias, we argue that it is probably
preferable to train a bespoke model on these annotations than it is to use an
LLM for annotation.
♻ ☆ BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR INTERSPEECH 2023
The recently proposed serialized output training (SOT) simplifies
multi-talker automatic speech recognition (ASR) by generating speaker
transcriptions separated by a special token. However, frequent speaker changes
can make speaker change prediction difficult. To address this, we propose
boundary-aware serialized output training (BA-SOT), which explicitly
incorporates boundary knowledge into the decoder via a speaker change detection
task and boundary constraint loss. We also introduce a two-stage connectionist
temporal classification (CTC) strategy that incorporates token-level SOT CTC to
restore temporal context information. Besides typical character error rate
(CER), we introduce utterance-dependent character error rate (UD-CER) to
further measure the precision of speaker change prediction. Compared to
original SOT, BA-SOT reduces CER/UD-CER by 5.1%/14.0%, and leveraging a
pre-trained ASR model for BA-SOT model initialization further reduces
CER/UD-CER by 8.4%/19.9%.
comment: Accepted by INTERSPEECH 2023
♻ ☆ Modeling Human Sentence Processing with Left-Corner Recurrent Neural Network Grammars EMNLP 2021
In computational linguistics, it has been shown that hierarchical structures
make language models (LMs) more human-like. However, the previous literature
has been agnostic about a parsing strategy of the hierarchical models. In this
paper, we investigated whether hierarchical structures make LMs more
human-like, and if so, which parsing strategy is most cognitively plausible. In
order to address this question, we evaluated three LMs against human reading
times in Japanese with head-final left-branching structures: Long Short-Term
Memory (LSTM) as a sequential model and Recurrent Neural Network Grammars
(RNNGs) with top-down and left-corner parsing strategies as hierarchical
models. Our computational modeling demonstrated that left-corner RNNGs
outperformed top-down RNNGs and LSTM, suggesting that hierarchical and
left-corner architectures are more cognitively plausible than top-down or
sequential architectures. In addition, the relationships between the cognitive
plausibility and (i) perplexity, (ii) parsing, and (iii) beam size will also be
discussed.
comment: Accepted by EMNLP 2021
♻ ☆ A Comprehensive Overview of Large Language Models
Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, Ajmal Mian
Large Language Models (LLMs) have recently demonstrated remarkable
capabilities in natural language processing tasks and beyond. This success of
LLMs has led to a large influx of research contributions in this direction.
These works encompass diverse topics such as architectural innovations of the
underlying neural networks, context length improvements, model alignment,
training datasets, benchmarking, efficiency and more. With the rapid
development of techniques and regular breakthroughs in LLM research, it has
become considerably challenging to perceive the bigger picture of the advances
in this direction. Considering the rapidly emerging plethora of literature on
LLMs, it is imperative that the research community is able to benefit from a
concise yet comprehensive overview of the recent developments in this field.
This article provides that overview to the research community. It not only
focuses on a systematic treatment of the existing literature on a broad range
of LLM related concept, but also pays special attention to providing
comprehensive summaries with extensive details about the individual existing
models, datasets and major insights. We also pay heed to aligning our overview
with the emerging outlook of this research direction by accounting for the
other recently materializing reviews of the broader research direction of LLMs.
Our self-contained comprehensive overview of LLMs discusses relevant background
concepts along with covering the advanced topics at the frontier of this
research direction. This review article is intended to not only provide a
systematic survey, but also a quick comprehensive reference for the researchers
and practitioners to draw insights from extensive informative summaries of the
existing works to advance the LLM research direction.
comment: Work in-progress
♻ ☆ Bridging Emotion Role Labeling and Appraisal-based Emotion Analysis
The term emotion analysis in text subsumes various natural language
processing tasks which have in common the goal to enable computers to
understand emotions. Most popular is emotion classification in which one or
multiple emotions are assigned to a predefined textual unit. While such setting
is appropriate to identify the reader's or author's emotion, emotion role
labeling adds the perspective of mentioned entities and extracts text spans
that correspond to the emotion cause. The underlying emotion theories agree on
one important point; that an emotion is caused by some internal or external
event and comprises several subcomponents, including the subjective feeling and
a cognitive evaluation. We therefore argue that emotions and events are related
in two ways. (1) Emotions are events; and this perspective is the fundament in
NLP for emotion role labeling. (2) Emotions are caused by events; a perspective
that is made explicit with research how to incorporate psychological appraisal
theories in NLP models to interpret events. These two research directions, role
labeling and (event-focused) emotion classification, have by and large been
tackled separately. We contributed to both directions with the projects SEAT
(Structured Multi-Domain Emotion Analysis from Text) and CEAT (Computational
Event Evaluation based on Appraisal Theories for Emotion Analysis), both funded
by the German Research Foundation. In this paper, we consolidate the findings
and discuss open research directions.
comment: accepted to the Big Picture Workshop
(https://bigpictureworkshop.com/)
♻ ☆ Teaching Large Language Models to Self-Debug
Large language models (LLMs) have achieved impressive performance on code
generation. However, for complex programming tasks, generating the correct
solution in one go becomes challenging, thus some prior works have designed
program repair approaches to improve code generation performance. In this work,
we propose Self-Debugging, which teaches a large language model to debug its
predicted program via few-shot demonstrations. In particular, we demonstrate
that Self-Debugging can teach the large language model to perform rubber duck
debugging; i.e., without any human feedback on the code correctness or error
messages, the model is able to identify its mistakes by investigating the
execution results and explaining the generated code in natural language.
Self-Debugging achieves the state-of-the-art performance on several code
generation benchmarks, including the Spider dataset for text-to-SQL generation,
TransCoder for C++-to-Python translation, and MBPP for text-to-Python
generation. On the Spider benchmark where there are no unit tests to verify the
correctness of predictions, Self-Debugging with code explanation consistently
improves the baseline by 2-3%, and improves the prediction accuracy on problems
of the hardest level by 9%. On TransCoder and MBPP where unit tests are
available, Self-Debugging improves the baseline accuracy by up to 12%.
Meanwhile, by leveraging feedback messages and reusing failed predictions,
Self-Debugging notably improves sample efficiency, and can match or outperform
baseline models that generate more than 10x candidate programs.
♻ ☆ DyVal: Graph-informed Dynamic Evaluation of Large Language Models
Large language models (LLMs) have achieved remarkable performance in various
evaluation benchmarks. However, concerns about their performance are raised on
potential data contamination in their considerable volume of training corpus.
Moreover, the static nature and fixed complexity of current benchmarks may
inadequately gauge the advancing capabilities of LLMs. In this paper, we
introduce DyVal, a novel, general, and flexible evaluation protocol for dynamic
evaluation of LLMs. Based on our proposed dynamic evaluation framework, we
build graph-informed DyVal by leveraging the structural advantage of directed
acyclic graphs to dynamically generate evaluation samples with controllable
complexities. DyVal generates challenging evaluation sets on reasoning tasks
including mathematics, logical reasoning, and algorithm problems. We evaluate
various LLMs ranging from Flan-T5-large to ChatGPT and GPT4. Experiments
demonstrate that LLMs perform worse in DyVal-generated evaluation samples with
different complexities, emphasizing the significance of dynamic evaluation. We
also analyze the failure cases and results of different prompting methods.
Moreover, DyVal-generated samples are not only evaluation sets, but also
helpful data for fine-tuning to improve the performance of LLMs on existing
benchmarks. We hope that DyVal can shed light on the future evaluation research
of LLMs.
comment: Technical report; 36 pages; code will be released at aka.ms/dyval
♻ ☆ Annotation Imputation to Individualize Predictions: Initial Studies on Distribution Dynamics and Model Predictions
Annotating data via crowdsourcing is time-consuming and expensive. Due to
these costs, dataset creators often have each annotator label only a small
subset of the data. This leads to sparse datasets with examples that are marked
by few annotators. The downside of this process is that if an annotator doesn't
get to label a particular example, their perspective on it is missed. This is
especially concerning for subjective NLP datasets where there is no single
correct label: people may have different valid opinions. Thus, we propose using
imputation methods to generate the opinions of all annotators for all examples,
creating a dataset that does not leave out any annotator's view. We then train
and prompt models, using data from the imputed dataset, to make predictions
about the distribution of responses and individual annotations.
In our analysis of the results, we found that the choice of imputation method
significantly impacts soft label changes and distribution. While the imputation
introduces noise in the prediction of the original dataset, it has shown
potential in enhancing shots for prompts, particularly for low-response-rate
annotators. We have made all of our code and data publicly available.
comment: NLPerspectives - 2nd Workshop on Perspectivist Approaches to NLP, 39
pages, 13 figures, 13 tables
♻ ☆ GOAL: A Challenging Knowledge-grounded Video Captioning Benchmark for Real-time Soccer Commentary Generation CIKM 2023
Ji Qi, Jifan Yu, Teng Tu, Kunyu Gao, Yifan Xu, Xinyu Guan, Xiaozhi Wang, Yuxiao Dong, Bin Xu, Lei Hou, Juanzi Li, Jie Tang, Weidong Guo, Hui Liu, Yu Xu
Despite the recent emergence of video captioning models, how to generate
vivid, fine-grained video descriptions based on the background knowledge (i.e.,
long and informative commentary about the domain-specific scenes with
appropriate reasoning) is still far from being solved, which however has great
applications such as automatic sports narrative. In this paper, we present
GOAL, a benchmark of over 8.9k soccer video clips, 22k sentences, and 42k
knowledge triples for proposing a challenging new task setting as
Knowledge-grounded Video Captioning (KGVC). Moreover, we conduct experimental
adaption of existing methods to show the difficulty and potential directions
for solving this valuable and applicable task. Our data and code are available
at https://github.com/THU-KEG/goal.
comment: Accepted by CIKM 2023
♻ ☆ Ring Attention with Blockwise Transformers for Near-Infinite Context
Transformers have emerged as the architecture of choice for many
state-of-the-art AI models, showcasing exceptional performance across a wide
range of AI applications. However, the memory demands imposed by Transformers
limit their ability to handle long sequences, thereby creating challenges for
tasks involving extended sequences or long-term dependencies. We present a
distinct approach, Ring Attention, which leverages blockwise computation of
self-attention to distribute long sequences across multiple devices while
concurrently overlapping the communication of key-value blocks with the
computation of blockwise attention. By processing longer input sequences while
maintaining memory efficiency, Ring Attention enables training and inference of
sequences that are device count times longer than those of prior
memory-efficient Transformers, effectively eliminating the memory constraints
imposed by individual devices. Extensive experiments on language modeling tasks
demonstrate the effectiveness of Ring Attention in allowing large sequence
input size and improving performance.
♻ ☆ Revisiting the Role of Language Priors in Vision-Language Models
Vision-language models (VLMs) are impactful in part because they can be
applied to a variety of visual understanding tasks in a zero-shot fashion,
without any fine-tuning. We study $\textit{generative VLMs}$ that are trained
for next-word generation given an image. We explore their zero-shot performance
on the illustrative task of image-text retrieval across 8 popular
vision-language benchmarks. Our first observation is that they can be
repurposed for discriminative tasks (such as image-text retrieval) by simply
computing the match score of generating a particular text string given an
image. We call this probabilistic score the $\textit{Visual Generative
Pre-Training Score}$ (VisualGPTScore). While the VisualGPTScore produces
near-perfect accuracy on some retrieval benchmarks, it yields poor accuracy on
others. We analyze this behavior through a probabilistic lens, pointing out
that some benchmarks inadvertently capture unnatural language distributions by
creating adversarial but unlikely text captions. In fact, we demonstrate that
even a "blind" language model that ignores any image evidence can sometimes
outperform all prior art, reminiscent of similar challenges faced by the
visual-question answering (VQA) community many years ago. We derive a
probabilistic post-processing scheme that controls for the amount of linguistic
bias in generative VLMs at test time without having to retrain or fine-tune the
model. We show that the VisualGPTScore, when appropriately debiased, is a
strong zero-shot baseline for vision-language understanding, oftentimes
producing state-of-the-art accuracy.
comment: Website: https://linzhiqiu.github.io/papers/visual_gpt_score/ Code:
https://github.com/linzhiqiu/visual_gpt_score/
♻ ☆ Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training
Shenggui Li, Hongxin Liu, Zhengda Bian, Jiarui Fang, Haichen Huang, Yuliang Liu, Boxiang Wang, Yang You
The success of Transformer models has pushed the deep learning model scale to
billions of parameters. Due to the limited memory resource of a single GPU,
However, the best practice for choosing the optimal parallel strategy is still
lacking, since it requires domain expertise in both deep learning and parallel
computing.
The Colossal-AI system addressed the above challenge by introducing a unified
interface to scale your sequential code of model training to distributed
environments. It supports parallel training methods such as data, pipeline,
tensor, and sequence parallelism, as well as heterogeneous training methods
integrated with zero redundancy optimizer. Compared to the baseline system,
Colossal-AI can achieve up to 2.76 times training speedup on large-scale
models.
♻ ☆ TADIS: Steering Models for Deep-Thinking about Demonstration Examples
Instruction tuning has been demonstrated that could significantly improve the
zero-shot generalization capability to unseen tasks by an apparent margin. By
incorporating additional context (e.g., task definition, examples) during the
fine-tuning process, Large Language Models (LLMs) achieved much higher
performance than before. However, recent work reported that delusive task
examples can achieve almost the same performance as correct task examples,
indicating the input-label correspondence is less important than previously
thought. Intrigued by this counter-intuitive observation, we suspect models
have the same illusion of competence as humans. Therefore, we propose a novel
method called TADIS that steers LLMs for "Deep-Thinking'' about demonstration
examples instead of merely seeing. To alleviate the illusion of competence of
models, we first ask the model to verify the correctness of shown examples.
Then, using the verification results as conditions to elicit models for a
better answer. Our experimental results show that TADIS consistently
outperforms competitive baselines on in-domain and out-domain tasks (improving
2.79 and 4.03 average ROUGLE-L on out-domain and in-domain datasets,
respectively). Despite the presence of generated examples (not all of the
thinking labels are accurate), TADIS can notably enhance performance in
zero-shot and few-shot settings. This also suggests that our approach can be
adopted on a large scale to improve the instruction following capabilities of
models without any manual labor. Moreover, we construct three types of thinking
labels with different model sizes and find that small models learn from the
format of TADIS but larger models can be steered for "Deep-Thinking''.
comment: 14 pages, 3 figures
♻ ☆ Rethinking the Evaluating Framework for Natural Language Understanding in AI Systems: Language Acquisition as a Core for Future Metrics
In the burgeoning field of artificial intelligence (AI), the unprecedented
progress of large language models (LLMs) in natural language processing (NLP)
offers an opportunity to revisit the entire approach of traditional metrics of
machine intelligence, both in form and content. As the realm of machine
cognitive evaluation has already reached Imitation, the next step is an
efficient Language Acquisition and Understanding. Our paper proposes a paradigm
shift from the established Turing Test towards an all-embracing framework that
hinges on language acquisition, taking inspiration from the recent advancements
in LLMs. The present contribution is deeply tributary of the excellent work
from various disciplines, point out the need to keep interdisciplinary bridges
open, and delineates a more robust and sustainable approach.
comment: 14 pages, 1 table, 2 figures
♻ ☆ AnglE-optimized Text Embeddings
High-quality text embedding is pivotal in improving semantic textual
similarity (STS) tasks, which are crucial components in Large Language Model
(LLM) applications. However, a common challenge existing text embedding models
face is the problem of vanishing gradients, primarily due to their reliance on
the cosine function in the optimization objective, which has saturation zones.
To address this issue, this paper proposes a novel angle-optimized text
embedding model called AnglE. The core idea of AnglE is to introduce angle
optimization in a complex space. This novel approach effectively mitigates the
adverse effects of the saturation zone in the cosine function, which can impede
gradient and hinder optimization processes. To set up a comprehensive STS
evaluation, we experimented on existing short-text STS datasets and a newly
collected long-text STS dataset from GitHub Issues. Furthermore, we examine
domain-specific STS scenarios with limited labeled data and explore how AnglE
works with LLM-annotated data. Extensive experiments were conducted on various
tasks including short-text STS, long-text STS, and domain-specific STS tasks.
The results show that AnglE outperforms the state-of-the-art (SOTA) STS models
that ignore the cosine saturation zone. These findings demonstrate the ability
of AnglE to generate high-quality text embeddings and the usefulness of angle
optimization in STS.
comment: NLP, Text Embedding, Semantic Textual Similarity
♻ ☆ Physics of Language Models: Part 1, Context-Free Grammar
We design controlled experiments to study HOW generative language models,
like GPT, learn context-free grammars (CFGs) -- diverse language systems with a
tree-like structure capturing many aspects of natural languages, programs, and
logics. CFGs are as hard as pushdown automata, and can be ambiguous so that
verifying if a string satisfies the rules requires dynamic programming. We
construct synthetic data and demonstrate that even for difficult (long and
ambiguous) CFGs, pre-trained transformers can learn to generate sentences with
near-perfect accuracy and impressive diversity.
More importantly, we delve into the physical principles behind how
transformers learns CFGs. We discover that the hidden states within the
transformer implicitly and precisely encode the CFG structure (such as putting
tree node information exactly on the subtree boundary), and learn to form
"boundary to boundary" attentions resembling dynamic programming. We also cover
some extension of CFGs as well as the robustness aspect of transformers against
grammar mistakes. Overall, our research provides a comprehensive and empirical
understanding of how transformers learn CFGs, and reveals the physical
mechanisms utilized by transformers to capture the structure and rules of
languages.
comment: V2 polishes writing and adds Appendix G